Hacemos analisis factorial para reducir las variables en otras variables resumen. Mientras la clusterización agrupaba filas, la factorización agrupa columnas. Pero, al igual que en clusterización, queremos saber si las nuevas variables tienen un nombre, al cual se le denomina técnicamente variable latente. En esta sesión exploraremos la data a ver qué emerge.
Para esta sesión trabajaremos con la data de estos links:
library(htmltab)
# links
happyL=c("https://en.wikipedia.org/wiki/World_Happiness_Report",
'//*[@id="mw-content-text"]/div/table/tbody')
demoL=c("https://en.wikipedia.org/wiki/Democracy_Index",
'//*[@id="mw-content-text"]/div/table[2]/tbody')
# carga
happy = htmltab(doc = happyL[1],which = happyL[2],encoding = "UTF-8")
demo = htmltab(doc = demoL[1], which = demoL[2], encoding = "UTF-8")
# limpieza
happy[,]=lapply(happy[,], trimws,whitespace = "[\\h\\v]") # no blanks
demo[,]=lapply(demo[,], trimws,whitespace = "[\\h\\v]") # no blanks
library(stringr) # nombres simples
names(happy)=str_split(names(happy)," ",simplify = T)[,1]
names(demo)=str_split(names(demo)," ",simplify = T)[,1]
## Formateo
# Eliminemos columnas que no usaremos esta vez:
happy$Overall=NULL
demo[,c(1,9,10)]=NULL
# También debemos tener nombres diferentes en los scores antes del merge:
names(happy)[names(happy)=="Score"]="ScoreHappy"
names(demo)[names(demo)=="Score"]="ScoreDemo"
# Tipo de variables:
## En demo:
demo[,-c(1)]=lapply(demo[,-c(1)],as.numeric)
# En happy:
happy[,-c(1)]=lapply(happy[,-c(1)],as.numeric)
# sin perdidos:
happy=na.omit(happy)
demo=na.omit(demo)
Presta atención al merge. Usualmente hacemos merge por default y luego perdemos varias filas:
nrow(merge(happy,demo))
## [1] 147
Hagamos un nuevo merge, donde nos quedemos con TODOS los paises que no estab en uno u otro data frame:
HappyDemo=merge(happy,demo,all.x=T, all.y=T)
Esta vez HappyDemo tiene varios paises de más, pero con valores perdidos y nombres que no pudieron coincidir. Veamos:
# formateando a
# HappyDemo[!complete.cases(HappyDemo),]
library(knitr)
library(kableExtra)
kable(HappyDemo[!complete.cases(HappyDemo),],type='html')%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
font_size = 10)
| Country | ScoreHappy | GDP | Social | Healthy | Freedom | Generosity | Perceptions | ScoreDemo | Electoral | Functioning | Politicalparticipation | Politicalculture | Civilliberties | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Angola | NA | NA | NA | NA | NA | NA | NA | 3.62 | 1.75 | 2.86 | 5.56 | 5.00 | 2.94 |
| 26 | Cape Verde | NA | NA | NA | NA | NA | NA | NA | 7.88 | 9.17 | 7.86 | 6.67 | 6.88 | 8.82 |
| 33 | Congo (Brazzaville) | 4.812 | 0.673 | 0.799 | 0.508 | 0.372 | 0.105 | 0.093 | NA | NA | NA | NA | NA | NA |
| 34 | Congo (Kinshasa) | 4.418 | 0.094 | 1.125 | 0.357 | 0.269 | 0.212 | 0.053 | NA | NA | NA | NA | NA | NA |
| 37 | Cuba | NA | NA | NA | NA | NA | NA | NA | 3.00 | 1.08 | 3.57 | 3.33 | 4.38 | 2.65 |
| 40 | Democratic Republic of the Congo | NA | NA | NA | NA | NA | NA | NA | 1.49 | 0.50 | 0.71 | 2.22 | 3.13 | 0.88 |
| 42 | Djibouti | NA | NA | NA | NA | NA | NA | NA | 2.87 | 0.42 | 1.79 | 3.89 | 5.63 | 2.65 |
| 47 | Equatorial Guinea | NA | NA | NA | NA | NA | NA | NA | 1.92 | 0.00 | 0.43 | 3.33 | 4.38 | 1.47 |
| 48 | Eritrea | NA | NA | NA | NA | NA | NA | NA | 2.37 | 0.00 | 2.14 | 1.67 | 6.88 | 1.18 |
| 50 | Eswatini | NA | NA | NA | NA | NA | NA | NA | 3.03 | 0.92 | 2.86 | 2.22 | 5.63 | 3.53 |
| 52 | Fiji | NA | NA | NA | NA | NA | NA | NA | 5.85 | 6.58 | 5.36 | 6.11 | 5.63 | 5.59 |
| 63 | Guinea-Bissau | NA | NA | NA | NA | NA | NA | NA | 1.98 | 1.67 | 0.00 | 2.78 | 3.13 | 2.35 |
| 64 | Guyana | NA | NA | NA | NA | NA | NA | NA | 6.67 | 9.17 | 5.71 | 6.11 | 5.00 | 7.35 |
| 83 | Kosovo | 6.100 | 0.882 | 1.232 | 0.758 | 0.489 | 0.262 | 0.006 | NA | NA | NA | NA | NA | NA |
| 115 | North Korea | NA | NA | NA | NA | NA | NA | NA | 1.08 | 0.00 | 2.50 | 1.67 | 1.25 | 0.00 |
| 117 | Northern Cyprus | 5.718 | 1.263 | 1.252 | 1.042 | 0.417 | 0.191 | 0.162 | NA | NA | NA | NA | NA | NA |
| 119 | Oman | NA | NA | NA | NA | NA | NA | NA | 3.04 | 0.00 | 3.93 | 2.78 | 4.38 | 4.12 |
| 121 | Palestine | NA | NA | NA | NA | NA | NA | NA | 4.39 | 3.83 | 2.14 | 7.78 | 4.38 | 3.82 |
| 122 | Palestinian Territories | 4.696 | 0.657 | 1.247 | 0.672 | 0.225 | 0.103 | 0.066 | NA | NA | NA | NA | NA | NA |
| 124 | Papua New Guinea | NA | NA | NA | NA | NA | NA | NA | 6.03 | 6.92 | 6.07 | 3.89 | 5.63 | 7.65 |
| 131 | Republic of the Congo | NA | NA | NA | NA | NA | NA | NA | 3.31 | 3.17 | 2.50 | 3.89 | 3.75 | 3.24 |
| 142 | Somalia | 4.668 | 0.000 | 0.698 | 0.268 | 0.559 | 0.243 | 0.270 | NA | NA | NA | NA | NA | NA |
| 145 | South Sudan | 2.853 | 0.306 | 0.575 | 0.295 | 0.010 | 0.202 | 0.091 | NA | NA | NA | NA | NA | NA |
| 148 | Sudan | NA | NA | NA | NA | NA | NA | NA | 2.15 | 0.00 | 1.79 | 2.78 | 5.00 | 1.18 |
| 149 | Suriname | NA | NA | NA | NA | NA | NA | NA | 6.98 | 9.17 | 6.43 | 6.67 | 5.00 | 7.65 |
| 150 | Swaziland | 4.212 | 0.811 | 1.149 | 0.000 | 0.313 | 0.074 | 0.135 | NA | NA | NA | NA | NA | NA |
| 158 | Timor-Leste | NA | NA | NA | NA | NA | NA | NA | 7.19 | 9.08 | 6.79 | 5.56 | 6.88 | 7.65 |
| 160 | Trinidad & Tobago | 6.192 | 1.231 | 1.477 | 0.713 | 0.489 | 0.185 | 0.016 | NA | NA | NA | NA | NA | NA |
| 161 | Trinidad and Tobago | NA | NA | NA | NA | NA | NA | NA | 7.16 | 9.58 | 7.14 | 6.11 | 5.63 | 7.35 |
De lo anterior date cuenta que, por un lado, hay paises que les falta un bloque de indicadores, y que en muchos casos los nombres están mal escritos. Podemos recuperar algunos, pero en la data original:
# cambiemos a nombres usados por otra tabla:
## en demo por happy
demo[demo$Country=="Democratic Republic of the Congo",'Country']="Congo (Kinshasa)"
demo[demo$Country=="Republic of the Congo",'Country']="Congo (Brazzaville)"
demo[demo$Country=="Trinidad and Tobago",'Country']="Trinidad & Tobago"
demo[demo$Country=="North Macedonia",'Country']="Macedonia"
## en happy por demo
happy[happy$Country=="Palestinian Territories",'Country']="Palestine"
Luego de esos ajustes veamos:
HappyDemo=merge(happy,demo) # re creando HappyDemo
nrow(HappyDemo)
## [1] 150
En efecto se recuperaron 5 paises, asi quedará.
El análisis factorial requiere que hagamos algunas observaciones previas.
theData=HappyDemo[,-c(1,2,9)] # sin los Scores ni nombre de país.
# esta es:
library(polycor)
corMatrix=polycor::hetcor(theData)$correlations
library(ggcorrplot)
ggcorrplot(corMatrix)
ggcorrplot(corMatrix,
p.mat = cor_pmat(corMatrix),
insig = "blank")
Si puedes ver bloques correlacionados, hay esperanza de un buen analisis factorial.
library(psych)
psych::KMO(corMatrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix)
## Overall MSA = 0.86
## MSA for each item =
## GDP Social Healthy
## 0.84 0.90 0.88
## Freedom Generosity Perceptions
## 0.82 0.59 0.77
## Electoral Functioning Politicalparticipation
## 0.80 0.91 0.94
## Politicalculture Civilliberties
## 0.89 0.85
Aqui hay dos pruebas:
cortest.bartlett(corMatrix,n=nrow(theData))$p.value>0.05
## [1] FALSE
library(matrixcalc)
is.singular.matrix(corMatrix)
## [1] FALSE
fa.parallel(theData,fm = 'ML', fa = 'fa')
## Parallel analysis suggests that the number of factors = 3 and the number of components = NA
Se sugieren 3, veamos:
library(GPArotation)
resfa <- fa(theData,nfactors = 3,cor = 'mixed',rotate = "varimax",fm="minres")
##
## mixed.cor is deprecated, please use mixedCor.
print(resfa$loadings)
##
## Loadings:
## MR1 MR3 MR2
## GDP 0.275 0.889 0.105
## Social 0.326 0.730
## Healthy 0.346 0.829 0.133
## Freedom 0.195 0.348 0.498
## Generosity -0.139 0.586
## Perceptions 0.253 0.684
## Electoral 0.938 0.169
## Functioning 0.752 0.446 0.334
## Politicalparticipation 0.721 0.325 0.102
## Politicalculture 0.545 0.308 0.502
## Civilliberties 0.915 0.318
##
## MR1 MR3 MR2
## SS loadings 3.443 2.744 1.477
## Proportion Var 0.313 0.249 0.134
## Cumulative Var 0.313 0.562 0.697
print(resfa$loadings,cutoff = 0.51)
##
## Loadings:
## MR1 MR3 MR2
## GDP 0.889
## Social 0.730
## Healthy 0.829
## Freedom
## Generosity 0.586
## Perceptions 0.684
## Electoral 0.938
## Functioning 0.752
## Politicalparticipation 0.721
## Politicalculture 0.545
## Civilliberties 0.915
##
## MR1 MR3 MR2
## SS loadings 3.443 2.744 1.477
## Proportion Var 0.313 0.249 0.134
## Cumulative Var 0.313 0.562 0.697
Cuando logramos que cada variable se vaya a un factor, tenemos una estructura simple.
fa.diagram(resfa)
resfa$crms
## [1] 0.04098225
resfa$RMSEA
## RMSEA lower upper confidence
## 0.09565491 0.06006255 0.12401922 0.90000000
resfa$TLI
## [1] 0.9444927
sort(resfa$communality)
## Generosity Freedom Perceptions
## 0.3629227 0.4067030 0.5363506
## Politicalparticipation Politicalculture Social
## 0.6351970 0.6440077 0.6489500
## Healthy Functioning GDP
## 0.8242328 0.8751018 0.8776003
## Electoral Civilliberties
## 0.9086463 0.9436943
sort(resfa$complexity)
## Electoral Generosity GDP
## 1.066692 1.114800 1.219569
## Civilliberties Perceptions Healthy
## 1.250343 1.291734 1.397214
## Social Politicalparticipation Functioning
## 1.423394 1.435972 2.063637
## Freedom Politicalculture
## 2.134773 2.580199
¿Qué nombres les darías?
as.data.frame(resfa$scores)
## MR1 MR3 MR2
## 1 -0.448363338 -1.505210390 -1.26233138
## 2 0.435857546 0.041076923 -0.77532536
## 3 -1.012984447 0.499826649 -0.86436092
## 4 0.727110953 0.478631634 -0.88481574
## 5 -0.108905556 0.129521455 -1.08047508
## 6 1.209836251 0.743904699 1.52266196
## 7 0.868921793 0.890280754 0.61984998
## 8 -1.517245634 0.776560237 -0.44027691
## 9 -1.957444919 1.497142249 0.25276075
## 10 0.147990329 -0.729674677 0.13247415
## 11 -1.633266434 0.913766984 -0.06376098
## 12 0.795221608 0.874883702 0.54820224
## 13 0.492006837 -1.767857078 0.35658723
## 14 -0.044475529 -0.513628262 0.90105909
## 15 0.419083675 -0.235354652 -0.71386360
## 16 0.055377983 0.210544432 -1.26810322
## 17 1.150109575 -0.363216027 -0.16077381
## 18 0.899102011 0.164147146 -0.95896609
## 19 0.822482531 0.444349039 -1.28198545
## 20 0.007435514 -1.448225162 0.35246166
## 21 -1.126408899 -1.861885744 0.47659243
## 22 -1.153980334 -0.285016060 1.09156940
## 23 -0.904823690 -0.986876367 -0.05874764
## 24 1.170021390 0.767225839 1.69127178
## 25 -0.812966741 -2.566557433 -0.64467300
## 26 -1.230289989 -1.450916081 -0.35637621
## 27 1.097595618 0.369968905 0.16112343
## 28 -1.981924509 1.026919761 0.74173876
## 29 0.829310347 0.231463766 -0.60509420
## 30 -0.285485991 -1.472150862 -0.30908408
## 31 -0.907035332 -0.550827260 -0.42438306
## 32 -1.487855906 -1.341784474 -0.10137891
## 33 1.055278202 0.369249685 0.13260399
## 34 0.533964447 0.540013827 -1.07365035
## 35 0.835450512 0.753142841 -0.52147656
## 36 0.818697766 0.805003929 -0.77095245
## 37 1.023434499 0.717208346 2.06028983
## 38 0.529713990 0.202810614 -0.60589467
## 39 0.438622982 0.184827603 -0.63522772
## 40 -1.027360811 0.140042715 -0.79702881
## 41 0.612234049 -0.165745769 -1.15079920
## 42 0.891271647 0.601147787 0.19490127
## 43 -1.161624135 -0.825287830 0.79308548
## 44 1.142433873 0.718269340 1.54074059
## 45 0.814228682 0.980854581 -0.24209701
## 46 -1.057407386 0.381134990 -0.88466688
## 47 -0.317698571 -1.491959508 1.02794162
## 48 0.128275369 -0.302261454 -0.84245526
## 49 1.059970335 0.745807790 0.98428511
## 50 0.675499683 -1.215043577 0.15834910
## 51 0.955371318 0.538443322 -1.59547812
## 52 0.298729066 -0.160906278 -0.29310414
## 53 -0.770403810 -1.299951253 -0.35768600
## 54 0.421900775 -1.884678012 0.34853125
## 55 0.375992213 -0.379959579 -0.26130280
## 56 -0.191989570 1.393057194 1.11503886
## 57 0.508186625 0.542620323 -1.02964948
## 58 1.191684960 0.843910136 1.30567174
## 59 0.928298988 -0.920366429 0.17278564
## 60 0.169401302 -0.271919022 1.00517028
## 61 -1.918430002 0.704267840 0.19669459
## 62 -0.752631206 0.027771812 -1.19020040
## 63 1.091541916 0.886678147 1.44845839
## 64 0.315454449 0.865213041 0.31818437
## 65 0.826095116 0.906123167 -0.94441159
## 66 -0.332403393 -1.366101111 -0.01105133
## 67 0.913630851 0.009196477 -0.29750647
## 68 0.760457539 1.065050330 0.03608209
## 69 -0.924291519 0.325787555 -0.04775219
## 70 -1.625819305 1.092595311 -0.34753866
## 71 -0.338953280 -0.857289178 1.02952608
## 72 -1.369464604 1.470318666 -0.28748581
## 73 -0.057190086 -0.416671181 -0.33003668
## 74 -1.765262163 -0.147463519 1.04518517
## 75 0.992328923 0.397587038 -0.90903761
## 76 -0.615168503 0.452174054 -0.89173688
## 77 1.013915456 -1.740540012 -0.26654311
## 78 0.559741792 -1.987727856 -0.22473088
## 79 -1.685392689 0.640420878 -0.34140852
## 80 1.036914460 0.562279534 -1.13787622
## 81 0.997262541 1.121924242 1.13711348
## 82 0.206206140 -1.417934823 -0.35400927
## 83 0.460784483 -1.891468342 0.49034840
## 84 0.127231046 0.513200069 0.35487074
## 85 0.578732915 -1.604991834 -0.31529628
## 86 0.832783702 0.700639168 1.21556096
## 87 -0.419067857 -0.652822448 -0.72523915
## 88 1.122836735 0.158983330 0.51704529
## 89 0.254077411 0.530602989 -0.97509381
## 90 0.548300518 -0.329326546 -0.95600322
## 91 0.698979923 -0.111025762 -0.59791742
## 92 0.162140653 0.509746727 -0.79198247
## 93 -0.438651312 -0.095484654 -0.27517566
## 94 -0.542094730 -1.497357080 0.48449961
## 95 -1.090087704 -0.578804792 1.84171690
## 96 0.502764782 -0.325029762 -0.44593606
## 97 -0.048142619 -0.736893894 0.77241914
## 98 0.989789360 0.783568753 1.59741701
## 99 1.271741248 0.600193003 1.87806075
## 100 -0.955774747 0.156842049 0.06481200
## 101 -0.026427854 -1.898223536 -0.38834364
## 102 -0.153029770 -1.022286876 -0.15761003
## 103 1.151388226 0.930239431 1.93447917
## 104 -0.116901106 -0.765787620 -0.23018601
## 105 -0.599161598 -0.187792395 -0.83996073
## 106 0.697996503 0.617509032 -0.84602967
## 107 0.603588554 -0.012737087 -0.41904362
## 108 0.661819881 0.179321459 -0.99234193
## 109 0.695707569 -0.344026505 -0.39053804
## 110 0.559053306 0.704913325 -0.95019389
## 111 0.940992323 0.820458025 -0.64421204
## 112 -1.889479925 1.941034269 0.62972432
## 113 0.573369863 0.478039571 -1.24429827
## 114 -1.393160649 1.079538842 -1.41108860
## 115 -1.039580742 -1.052058551 2.17771827
## 116 -2.270615142 1.630504363 -0.37030869
## 117 0.655796018 -1.142008108 0.12234784
## 118 0.563980216 0.296919014 -0.97006374
## 119 0.297328958 -2.023599468 -0.17062056
## 120 -0.351480225 1.649827264 1.91497597
## 121 0.746184523 0.696133598 -0.95566057
## 122 0.708650402 0.840023327 -0.41732062
## 123 0.831985970 -0.335620014 -0.20120855
## 124 0.882232794 0.723735446 -0.35759723
## 125 0.880442528 0.926887655 -0.30460694
## 126 0.217172341 0.118068727 0.02804046
## 127 1.061431795 0.702282525 2.14207421
## 128 0.917615645 0.939304546 1.81019765
## 129 -1.881972055 -0.946463679 0.15327054
## 130 1.011745087 0.737206608 -0.20300808
## 131 -1.916101434 -0.252043497 0.71549347
## 132 0.088056964 -1.308498166 0.75571207
## 133 -0.617739539 0.669607203 0.25882623
## 134 -0.688909894 -1.546058514 -0.22552850
## 135 0.647817400 0.416648073 -0.42899140
## 136 0.192572939 0.040005697 -0.57629892
## 137 -1.098641190 0.866713628 -0.35353389
## 138 -2.269060367 0.875261230 -0.17607996
## 139 0.227994072 -1.444571636 0.37464732
## 140 0.164098634 -0.072143160 -0.85650947
## 141 -2.068040244 1.614321935 0.86095558
## 142 1.002311633 0.658972104 1.13401659
## 143 0.712520417 0.787703995 0.32667956
## 144 1.252890503 0.330420021 0.36019307
## 145 -2.133188960 0.419666632 1.39928036
## 146 -1.226464789 0.754504485 -1.13953813
## 147 -1.647582289 0.573472513 0.47816425
## 148 -1.620179826 -0.849559396 -0.32581892
## 149 0.345289273 -1.159620886 0.50255178
## 150 -1.065366698 -0.900876793 0.33166415
HappyDemoFA=cbind(HappyDemo[1],as.data.frame(resfa$scores))
library(plotly)
plot_ly(data=HappyDemoFA, x = ~MR1, y = ~MR2, z = ~MR3, text=~Country) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Demo'),
yaxis = list(title = 'Tranquilidad'),
zaxis = list(title = 'Bienestar')))
RECORDANDO:
library(fpc)
library(cluster)
library(dbscan)
# YA NO NECESITAS CMD para HappyDemoFA[,c(2:4)]
g.dist.cmd = daisy(HappyDemoFA[,c(2:4)], metric = 'euclidean')
kNNdistplot(g.dist.cmd, k=3)
Para tener una idea de cada quien:
resDB=fpc::dbscan(g.dist.cmd, eps=0.6, MinPts=3,method = 'dist')
HappyDemoFA$clustDB=as.factor(resDB$cluster)
aggregate(cbind(MR1, MR2,MR3) # dependientes
~ clustDB, # nivel
data = HappyDemoFA, # data
max) # operacion
## clustDB MR1 MR2 MR3
## 1 0 -0.1919896 2.1777183 1.6498273
## 2 1 1.2717412 2.1420742 1.1219242
## 3 2 -1.0986412 0.4781642 1.4703187
## 4 3 -0.6177395 0.2588262 0.6696072
## 5 4 -1.8894799 0.8609556 1.9410343
plot_ly(data=HappyDemoFA, x = ~MR1, y = ~MR2, z = ~MR3, text=~Country, color = ~clustDB) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Demo'),
yaxis = list(title = 'Tranquilidad'),
zaxis = list(title = 'Bienestar')))
Aqui acaba la Unidad II, el analisis factorial confirmatorio se verá en la siguiente Unidad.